Ludwig - Papers: Computer Science

Cache-to-Cache: Direct Semantic Communication Between Large Language Models

Tianyu Fu, et al. • (2025) • DOI: 10.48550/arXiv.2510.03215

Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate...

Continual Learning: Fast and Slow

Quang Pham, et al. • (2022) • DOI: 10.48550/arXiv.2209.02370

According to the Complementary Learning Systems (CLS) theory~\cite{mcclelland1995there} in neuroscience, humans do effective \emph{continual learning} through two complementary systems: a fast learnin...

The Future of Continual Learning in the Era of Foundation Models: Three Key Directions

Jack Bell, et al. • (2025) • DOI: 10.48550/arXiv.2506.03320

Continual learning--the ability to acquire, retain, and refine knowledge over time--has always been fundamental to intelligence, both human and artificial. Historically, different AI paradigms have ac...

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, et al. • (2025) • DOI: 10.48550/arXiv.2509.02046

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2x speedup. We posit that two methodological shortcomings hav...

Autonomous Code Evolution Meets NP-Completeness

Cunxi Yu, et al. • (2025) • DOI: 10.48550/arXiv.2509.07367

Large language models (LLMs) have recently shown strong coding abilities, enabling not only static code generation but also iterative code self-evolving through agentic frameworks. Recently, AlphaEvol...

Learning Universal Predictors

Jordi Grau-Moya, et al. • (2024) • DOI: 10.48550/arXiv.2401.14953

Meta-learning has emerged as a powerful approach to train neural networks to learn new tasks quickly from limited data. Broad exposure to different tasks leads to versatile representations enabling ge...

Large Language Models as Computable Approximations to Solomonoff Induction

Jun Wan, Lingrui Mei • (2025) • DOI: 10.48550/arXiv.2505.15784

The rapid advancement of large language models (LLMs) calls for a rigorous theoretical framework to explain their empirical success. While significant progress has been made in understanding LLM behav...

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, et al. • (2025) • DOI: 10.48550/arXiv.2507.21509

Large language models interact with users through a simulated 'Assistant' persona. While the Assistant is typically trained to be helpful, harmless, and honest, it sometimes deviates from these ideals...

DataRater: Meta-Learned Dataset Curation

Dan A. Calian, et al. • (2025) • DOI: 10.48550/arXiv.2505.17895

The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mi...

Diffusion Beats Autoregressive in Data-Constrained Settings

Mihir Prabhudesai, et al. • (2025) • DOI: 10.48550/arXiv.2507.15857

Autoregressive (AR) models have long dominated the landscape of large language models, driving progress across a wide range of tasks. Recently, diffusion-based language models have emerged as a promis...

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Alex Cloud, et al. • (2025) • DOI: 10.48550/arXiv.2507.14805

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (su...

Large Language Models and Emergence: A Complex Systems Perspective

David C. Krakauer, et al. • (2025) • DOI: 10.48550/arXiv.2506.11135

Emergence is a concept in complexity science that describes how many-body systems manifest novel higher-level properties, properties that can be described by replacing high-dimensional mechanisms with...

Transformers are Efficient Compilers, Provably

Xiyu Zhai, et al. • (2025) • DOI: 10.48550/arXiv.2410.14706

Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generat...

Fast and Simplex: 2-Simplicial Attention in Triton

Aurko Roy, et al. • (2025) • DOI: 10.48550/arXiv.2507.02754

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count toge...

Simple linear attention language models balance the recall-throughput tradeoff

Simran Arora, et al. • (2025) • DOI: 10.48550/arXiv.2402.18668

Recent work has shown that attention-based language models excel at recall, the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is...

Dimension Mixer: A Generalized Method for Structured Sparsity in Deep Neural Networks

Suman Sapkota, Binod Bhattarai • • (2023) • DOI: 10.48550/arXiv.2311.18735

The recent success of multiple neural architectures like CNNs, Transformers, and MLP-Mixers motivated us to look for similarities and differences between them. We found that these architectures can be...

How Much Knowledge Can You Pack Into the Parameters of a Language Model?

Adam Roberts, et al. • • (2020) • DOI: 10.48550/arXiv.2002.08910

It has recently been observed that neural language models trained on unstructured text can implicitly store and retrieve knowledge using natural language queries. In this short paper, we measure the p...

Continuous Thought Machines

Luke Darlow, et al. • • (2025) • DOI: 10.48550/arXiv.2505.05522

Biological brains demonstrate complex neural activity, where the timing and interplay between neurons is critical to how brains process information. Most deep learning architectures simplify neural ac...

The Diffusion Duality

Subham Sekhar Sahoo, et al. • • (2025) • DOI: 10.48550/arXiv.2506.10892

Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and ma...

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

C. Opus, A. Lawsen • • (2025) • DOI: 10.48550/arXiv.2506.09250

Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily refle...

Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang, David C. Parkes • • (2023) • DOI: 10.48550/arXiv.2309.08589

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-gen...

General agents need world models

Jonathan Richens, et al. • • (2025) • DOI: 10.48550/arXiv.2506.01622

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of gene...

Reasoning with Language Model is Planning with World Model

Shibo Hao, et al. • • (2023) • DOI: 10.48550/arXiv.2305.14992

Large language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs can still stru...

Compute-Optimal LLMs Provably Generalize Better With Scale

Marc Finzi, et al. • • (2025) • DOI: 10.48550/arXiv.2504.15208

Why do larger language models generalize better? To investigate this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regi...

Large Language Model Compression with Global Rank and Sparsity Optimization

Changhai Zhou, et al. • • (2025) • DOI: 10.48550/arXiv.2505.03801

Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs). However, such an idea faces two primary challenges that adversely affect the performance of exis...

The Complexity Dynamics of Grokking

Branton DeMoss, et al. • • (2024) • DOI: 10.48550/arXiv.2412.09810

We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain grokking, where networks suddenly transit...

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstei..., et al. • • (2015) • DOI: 10.48550/arXiv.1503.03585

A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still a...

Mechanistic Design and Scaling of Hybrid Architectures

Michael Poli, et al. • • (2024) • DOI: 10.48550/arXiv.2403.17844

The development of deep learning architectures is a resource-demanding process, due to a vast design space, long prototyping times, and high compute costs associated with at-scale model training and e...

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M. Bronstein, et al. • • (2021) • DOI: 10.48550/arXiv.2104.13478

The last decade has witnessed an experimental revolution in data science and machine learning, epitomised by deep learning methods. Indeed, many high-dimensional learning tasks previously thought to b...

Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, et al. • • (2025) • DOI: 10.48550/arXiv.2506.01855

Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sen...

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, et al. • • (2022) • DOI: 10.48550/arXiv.2205.14135

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to addr...

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Yang Chen, et al. • • (2025) • DOI: 10.48550/arXiv.2505.16400

Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of front...

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

Shenzhi Wang, et al. • • (2025) • DOI: 10.48550/arXiv.2506.01939

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well ...

REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Zafir Stojanovski, et al. • • (2025) • DOI: 10.48550/arXiv.2505.24760

We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains in...

Learning to Model the World with Language

Jessy Lin, et al. • • (2024) • DOI: 10.48550/arXiv.2308.01399

To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. While current agents can learn to execute simple langua...

Hardware-Efficient Attention for Fast Decoding

Ted Zadouri, et al. • • (2025) • DOI: 10.48550/arXiv.2505.21487

LLM decoding is bottlenecked for large batches and long contexts by loading the key-value (KV) cache from high-bandwidth memory, which inflates per-token latency, while the sequential nature of decodi...

FP8 Formats for Deep Learning

Paulius Micikevicius, et al. • • (2022) • DOI: 10.48550/arXiv.2209.05433

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary inte...

Reinforcement Learning Finetunes Small Subnetworks in Large Language Models

Sagnik Mukherjee, et al. • • (2025) • DOI: 10.48550/arXiv.2505.11711

Reinforcement learning (RL) yields substantial improvements in large language models (LLMs) downstream task performance and alignment with human values. Surprisingly, such large gains result from upda...

Deep Reinforcement Learning, a textbook

Aske Plaat • • (2022) • DOI:

Deep reinforcement learning has gathered much attention recently. Impressive results were achieved in activities as diverse as autonomous driving, game playing, molecular recombination, and robotics. ...

Large Language Diffusion Models

Shen Nie, et al. • • (2025) • DOI: 10.48550/arXiv.2502.09992

Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre...

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao, et al. • • (2025) • DOI: 10.48550/arXiv.2505.03335

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR wo...

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

Siliang Zeng, et al. • • (2025) • DOI: 10.48550/arXiv.2505.11821

This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios...

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of LMO-based Optimizers for LLMs)

Artem Riabinin, et al. • • (2025) • DOI: 10.48550/arXiv.2505.13416

Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as $\sf Muon$ and $\sf Scion$. After over a ...

Visual Planning: Let's Think Only with Images

Yi Xu, et al. • • (2025) • DOI: 10.48550/arXiv.2505.11409

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely...

The Platonic Representation Hypothesis

Minyoung Huh, et al. • • (2024) • DOI: 10.48550/arXiv.2405.07987

We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways...

Round and Round We Go! What makes Rotary Positional Encodings useful?

Federico Barbero, et al. • • (2025) • DOI: 10.48550/arXiv.2410.06205

Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most p...

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, et al. • • (2023) • DOI: 10.48550/arXiv.2309.06180

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for eac...

Iteratively reweighted kernel machines efficiently learn sparse functions

Libin Zhu, et al. • • (2025) • DOI: 10.48550/arXiv.2505.08277

The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, ...

Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features

Adityanarayanan Radh..., et al. • • (2023) • DOI: 10.48550/arXiv.2212.13881

In recent years neural networks have achieved impressive results on many technological and scientific tasks. Yet, the mechanism through which these models automatically select features, or patterns in...

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, et al. • • (2023) • DOI: 10.48550/arXiv.2305.16291

We introduce Voyager, the first LLM-powered embodied lifelong learning agent in Minecraft that continuously explores the world, acquires diverse skills, and makes novel discoveries without human inter...

Denoising Diffusion Probabilistic Models

Jonathan Ho, et al. • • (2020) • DOI: 10.48550/arXiv.2006.11239

We present high quality image synthesis results using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics. Our best results ...

Learning high-level visual representations from a child's perspective without strong inductive biases

A. Emin Orhan, Brenden M. Lake • • (2023) • DOI: 10.48550/arXiv.2305.15372

Young children develop sophisticated internal models of the world based on their visual experience. Can such models be learned from a child's visual experience without strong inductive biases? To inve...

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Tiberiu Musat • • (2025) • DOI: 10.48550/arXiv.2411.12118

In this paper, I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers, which grows logarithmically with the input...

Scaling Laws for Precision

Tanishq Kumar, et al. • • (2024) • DOI: 10.48550/arXiv.2411.04330

Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for b...

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

Liqun Ma, et al. • • (2024) • DOI: 10.48550/arXiv.2407.07093

This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary L...

Diffusion Models are Evolutionary Algorithms

Yanbo Zhang, et al. • • (2024) • DOI: 10.48550/arXiv.2410.02543

In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we m...

Similarity of Neural Network Representations Revisited

Simon Kornblith, et al. • • (2019) • DOI: 10.48550/arXiv.1905.00414

Recent work has sought to understand the behavior of neural networks by comparing representations between layers and between different trained models. We examine methods for comparing neural network r...

Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture

Mahmoud Assran, et al. • • (2023) • DOI: 10.48550/arXiv.2301.08243

This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Archi...

Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability

Zicheng Lin, et al. • • (2025) • DOI: 10.48550/arXiv.2411.19943

Mathematical reasoning tasks pose significant challenges for large language models (LLMs) because they require precise logical deduction and sequence analysis. In this work, we introduce the concept o...

A mathematical theory of semantic development in deep neural networks

Andrew M. Saxe, et al. • Proceedings of the National Academy of Sciences • (2019) • DOI: 10.1073/pnas.1820226116

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fund...

Progress measures for grokking via mechanistic interpretability

Neel Nanda, et al. • • (2023) • DOI: 10.48550/arXiv.2301.05217

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding em...

Towards Automated Circuit Discovery for Mechanistic Interpretability

Arthur Conmy, et al. • • (2023) • DOI: 10.48550/arXiv.2304.14997

Through considerable effort and intuition, several recent works have reverse-engineered nontrivial behaviors of transformer models. This paper systematizes the mechanistic interpretability process the...

On the Emergence of Thinking in LLMs I: Searching for the Right Intuition

Guanghao Ye, et al. • • (2025) • DOI: 10.48550/arXiv.2502.06773

Recent advancements in AI, such as OpenAI’s new o models, Google’s Gemini Thinking model, and Deepseek R1, are transforming LLMs into LRMs (Large Reasoning Models). Unlike LLMs, LRMs perform thinking ...

Computer Science - Machine Learning

Subcategories

Papers